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Abstract —Human infants can discover words directly from 
unsegmented speech signals without any explicitly labeled data. 
The main problem of this paper is to develop a computational 
model that can estimate language and acoustic models, and 
discover words directly from continuous human speech signals 
in an unsupervised manner. For this purpose, we propose an 
integrative generative model that combines a language model 
and an acoustic model into a single generative model called 
the “hierarchical Dirichlet process hidden language model” 
(HDP-HLM). The HDP-HLM is obtained by extending the 
hierarchical Dirichlet process hidden semi-Markov model (HDP- 
HSMM) proposed by Johnson et al. An inference procedure 
for the HDP-HLM is derived using the blocked Gibbs sampler 
originally proposed for the HDP-HSMM. This procedure enables 
the simultaneous and direct inference of language and acoustic 
models from continuous speech signals. Based on the HDP-HLM 
and its inference procedure, we develop a novel machine learn¬ 
ing method called nonparametric Bayesian double articulation 
analyzer (NPB-DAA) that can directly acquire language and 
acoustic models from observed continuous speech signals. By 
assuming HDP-HLM as a generative model of observed time 
series data, and by inferring latent variables of the model, the 
method can analyze latent double articulation structure, i.e., 
hierarchically organized latent words and phonemes, of the data 
in an unsupervised manner. We also carried out two evaluation 
experiments using synthetic data and actual human continuous 
speech signals representing Japanese vowel sequences. In the 
word acquisition and phoneme categorization tasks, the NPB- 
DAA outperformed a conventional double articulation analyzer 
(DAA) and baseline automatic speech recognition system whose 
acoustic model was trained in a supervised manner. The main 
contributions of this paper are as follows: (1) We develop 
a probabilistic generative model that integrates language and 
acoustic models, i.e., HDP-HLM. (2) We derive an inference 
method for this, and propose the NPB-DAA. (3) We show that the 
NPB-DAA can discover words directly from continuous human 
speech signals in an unsupervised manner. 

Index Terms —Language acquisition, child development, 
Bayesian nonparametrics, latent variable model 

I. INTRODUCTION 

I NFANTS must solve the word segmentation problem in 
order to acquire language from continuous speech signals 
to which they are exposed. The word segmentation problem 
is that of identifying word boundaries in continuous speech. 
If the speech signals are given to infants as isolated words, 
the task is easy for them. However, it has been known that 
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a relatively small number of infant-directed utterances consist 
of an isolated word (T). If infants had knowledge about words 
and phonemes innately, the problem could be solved relatively 
easily. On the contrary, the fact that each language has different 
lists of phonemes and words clearly shows that infants have 
to acquire them through developmental processes. 

From the viewpoint of statistical learning, the learning 
problem, i.e., direct language acquisition from continuous 
speech signals, is very difficult because infants do not have 
access to the truth labels of speech recognition results. In other 
words, the language acquisition process must be completely 
unsupervised.The main problem of this paper is to develop a 
computational model that can estimate language and acoustic 
models, and discover words directly from continuous human 
speech signals. 

Most modern automatic speech recognition (ASR) systems 
have a language model that represents knowledge about words 
and their distributional probabilities as well as an acoustic 
model that represents knowledge about phonemes and their 
acoustic features, e.g., [?], [?]. Both are usually trained us¬ 
ing large transcribed speech datasets and linguistic corpora 
through supervised learning. However, infants do not have 
access to such explicitly labeled datasets. They have to acquire 
both language and acoustic models from raw acoustic speech 
signals in an unsupervised manner. 

The question about what kind of cues human infants utilize 
to discover words from continuous speech signals arises. 
Saffran et al. listed three types of cues for word segmenta¬ 
tion: 1) prosodic, 2) distributional, and 3) co-occurrence 0. 
1) Prosodic cues rely ou acoustic information, such as post¬ 
utterance pauses, stressed syllables, and acoustically distinc¬ 
tive final syllables. 2) Distributional cues represent the statisti¬ 
cal relationships between pairs of neighboring speech sounds. 
3) Co-occurrence cues are used by children to learn words 
by detecting sounds that co-occur with certain entities in the 
environment. Although many researchers had considered the 
distributional cues to be too complex for infants to use, Saffran 
reported that word segmentation from fluent speech can be 
accomplished by 8-month-old infants based on solely on 
distributional cues 0. It is also reported that the distributional 
cues seem to be used by infants by the age of 7 months, which 
is earlier than most other cues ffl. These results imply that 
infants have a fundamental mechanism that can estimate word 
segments using distributional cues. In addition to this fun¬ 
damental segmentation mechanism using distributional cues, 
the prosodic and co-occurrence cues are believed to help the 
word segmentation task only as supplemental cues 0. From 
the viewpoint of phonemic category acquisition, distributional 
patterns of sounds have been considered to provide infants 
with clues about the phonemic structure of a language as 
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well 0. 

Based on these findings, in this paper, we focus on dis¬ 
tributional cues. We explore the fundamental computational 
mechanism that can discover words from speech signals using 
only distributional cues, and develop an unsupervised machine 
learning method which can discover phonemes and words 
directly from unsegmented speech signals 

In this paper, we propose an unsupervised learning method 
called the nonparametric Bayesian double articulation analyzer 
(NPB-DAA) that can automatically estimate double articula¬ 
tion structures, i.e., hierarchically organized latent words and 
phonemes, embedded in speech signals. We propose this as a 
computationally valid explanation for the simultaneous acqui¬ 
sition of language and acoustic models. To develop the NPB- 
DAA, we introduce a probabilistic generative model called the 
hierarchical Dirichlet process hidden language model (HDP- 
HLM) as well as its inference algorithm. 

The remainder of this paper is organized as follows. Sec¬ 
tion m describes the background of the proposed method. 
Section |III] presents the HDP-HLM by extending hierarchical 
Dirichlet process-hidden semi-Markov model (HDP-HSMM) 
proposed by Johnson et al. [6]. The HDP-HLM is an proba¬ 
bilistic generative model that integrates acoustic and language 
models for continuous speech signals. Section[IV]describes the 
inference procedure of HDP-HLM, and our proposed NPB- 
DAA. Sections [V] and [VI] evaluate the effectiveness of the 
proposed method using synthetic data and actual sequential 
vowel speech signals. Section IVTIl concludes this paper. 

II. Background 

A. Word segmentation using distributional cues in transcribed 
data 

With respect to statistical computational models, many 
kinds of unsupervised machine learning methods for word 
segmentation have been proposed in the last two decades GI¬ 
GS). Brent (7) proposed model-based dynamic programming 
1 (MBDP-1) for recovering deleted word boundaries in a 
natural-language text. The MBDP-1 presumes that there is 
an information source generating the text explicitly and seg¬ 
ments the target text so as to maximize the text’s prob¬ 
ability. Venkataraman proposed a statistical model for 
segmentation and word discovery from phoneme sequences 
by improving Brent’s algorithm. 

Recently, Bayesian nonparametrics, including the hierar¬ 
chical Dirichlet process and hierarchical Pitman-Yor pro¬ 
cess, have enabled more sophisticated methods for word 
segmentation. These models have fully Bayesian generative 
models and make it possible to calculate the appropriately 
smoothed n-gram probability for a word that has a long 
context. Theoretically, they can treat an infinite number of 
possible words. Goldwater (9), flOl proposed an HDP-based 
word segmentation method and showed that taking context 
into account is important for statistical word segmentation. 
Mochihashi et al. KID proposed a nested Pitman-Yor language 
model (NPYLM), in which a letter n-gram model based on a 
hierarchical Pitman-Yor language model is embedded in the 
word n-gram model. They also developed the forward filtering 


backward sampling procedure to achieve efficient blocked 
Gibbs sampling and hence infer word boundaries. 

However, all of the above mentioned word segmentation 
methods presume that transcribed phoneme sequences or text 
data without any recognition errors can be obtained by the 
learning system. In practice, before acquiring a language 
model containing an inventory of words, a learning system, 
i.e., an infant, has to recognize speech signals without any 
knowledge of words, only with the knowledge of phonemes 
and/or syllables in an acoustic model. In such a recognition 
task, the phoneme recognition error rate inevitably becomes 
high. To overcome this problem, several researchers have pro¬ 
posed word discovery methods utilizing co-occurrence cues. 

B. Lexical acquisition using co-occurrence cues 

Roy et al. m ambitiously implemented a computational 
model that enables a robot to autonomously discover words 
from raw multimodal sensory input. Their results were imper¬ 
fect compared with recent state-of-art results. However, their 
results showed it was possible to develop cognitive models that 
can process raw sensor data and acquire a lexicon without the 
need for human transcription or labeling. 

Iwahashi et al. G3 implemented an interactive learning 
method for a robot to acquire spoken words through human- 
robot interaction using audio-visual interfaces. Their learning 
process was carried out on-line, incrementally, actively, and 
in an unsupervised manner. Iwahashi et al. lfl8) also proposed 
a method that enables a robot to learn linguistic knowl¬ 
edge through human-robot communication in an unsupervised 
manner. The model combines speech, visual, and behavioral 
information in a probabilistic framework. Though its perfor¬ 
mance was still limited, the model is considered to be a more 
sophisticated model than that proposed in Roy et al.’s previous 
study fl6) from the viewpoint of statistical machine learning. 
On the basis of this work, Iwahashi et al. m developed an 
integrated online machine learning system combining speech, 
visual, and tactile information obtained through interaction. It 
enabled robots to learn beliefs regarding speech units, words, 
the concepts of objects, motions, grammar, and pragmatic and 
communicative capabilities. They called the system LCore. 

Araki et al. Ii20l built a robot that formed object categories 
and acquired their names by combining a multimodal latent 
Dirichlet allocation (MLDA) and the NPYLM. They showed 
that the iterative learning of MLDA and NPYLM increases 
word segmentation performance by using distributional cues 
and co-occurrence cues simultaneously, but they reported that 
the prediction accuracy decreases as the phoneme recognition 
error rate increases. To overcome this problem, Nakamura et 
al. integrated statistical models for word segmentation and 
multimodal categorization. They showed that a robot can 
autonomously form object categories and related words from 
continuous speech signals and continuous visual, auditory, and 
haptic information by updating its language and categorization 
models iteratively ETI . 

Not only object information, but also place information can 
be used as co-occurrence cues. Taguchi et al. (22j proposed 
a method for the unsupervised learning of place-names from 
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information pairs that consist of spoken utterances and the 
mobile robot’s estimated current location without any prior 
linguistic knowledge other than a phoneme acoustic model. 
They optimized a word list using a model selection method 
based on description length criterion. 

C. Word segmentation using distributional cues in noisy input 

As described above, it becomes clear that using co¬ 
occurrence cues can mitigate the ill effects of phoneme recog¬ 
nition errors in a word discovery task. However, whether or 
not the word discovery task can be achieved solely from raw 
speech signals is still an open question. Neubig et al. l24l 
extended the unsupervised morphological analyzer proposed 
by Mochihashi et al. and enabled it to analyze phoneme 
lattices. Heymann et al. ||25l modified Neubig et al.’s algorithm 
and proposed a suboptimal two-stage algorithm. Heymann 
et al. reported that their proposed method outperformed the 
original method in an experiment that used lattice input 
generated artificially from text input. In addition, they used 
the discovered language model for phoneme recognition in 
an iterative manner and reported that recognition performance 
was improved l26l . Eisner et al. l27l proposed a computational 
model that jointly performs word segmentation and learns an 
explicit model of phonetic variation. However, they did not 
start with acoustic sound, but with dictated noisy text, i.e., 
recognized phoneme sequences with errors. Their model does 
not include acoustic model learning. 

They showed that the ill effect of phoneme recognition 
errors can be mitigated to some extent by using distribu¬ 
tional information more appropriately. However, all of these 
methods, except for Iwahashi et al., used an acoustic model 
previously trained in a supervised manner. Therefore, these 
models are insufficient as a constructive model for language 
acquisition from raw speech signals. Hence, the unsupervised 
learning of an acoustic model is also an important problem. 

D. Unsupervised learning of an acoustic model 

In contrast with the word segmentation task, the acquisition 
of an acoustic model is basically a categorization task of the 
feature vectors transformed from continuous speech signals. 
Mixture models, including hidden Markov models (HMMs) 
and Gaussian mixture models, have been used to model 
phoneme category acquisition. For example. Lake et al. Il28l 
used an online mixture estimation model for vowel category 
learning. The model was originally proposed by Vallabha et 
al. [?]. However, the phoneme acquisition has proven to be 
complex categorization task in a feature space. The distribution 
of the feature vectors of each phoneme overlap with each other, 
and the actual sound of the phoneme depends on its context. 
Feldman et al. l29l pointed out that feedback information 
from segmented words is important for phonetic category 
acquisition. They demonstrated this effect through simulations 
using Bayesian models. 

Lee et al. O0l proposed a hierarchical Bayesian model that 
can discover a proper set of sub-word units and an acoustic 
model in an unsupervised manner. However, their model did 
not estimate the language model. Lee et al. ED also proposed 


a hierarchical Bayesian model simultaneously discovering the 
phonetic inventory and the Letter-to-Sound mapping rules 
on the basis of transcribed data only. The method is not a 
completely unsupervised learning method from raw speech 
signals, but does automatically determine relations between 
sounds and transcribed alphabets and forms an acoustic model 
in an unsupervised manner. 

There have been several studies about the simultaneous 
unsupervised learning of acoustic and language models. How¬ 
ever, a very small number of statistical learning methods that 
can simultaneously acquire integrated acoustic and language 
models have been proposed. Brandi et al. ll32l attempted 
to develop an unsupervised learning method that enables a 
robot to simultaneously obtain phonemes, syllables, and words 
from acoustic speech. They did not successfully build such a 
system, but reported their preliminary results. Walter et al. El 
proposed a word discovery method that uses an HMM-based 
method for finding acoustic unit descriptors in parallel with a 
dynamic time warping technique for finding word segments. 
However, their model is still heuristic from the viewpoint of 
probabilistic computational models. As Feldman et al. pointed 
out, word segmentation and phonetic category acquisition are 
undoubtedly mutually dependent. Therefore, a theoretically 
integrated probabilistic generative model for the simultaneous 
acquisition of language and acoustic models is desirable. 
Very recently, Kamper et al. [?] and Lee et al. [?]proposed 
probabilistic computational models that achieved unsupervised 
direct word discovery from continuous speech signals. How¬ 
ever, they did not provide an explicit, integrated probabilistic 
generative model for unsupervised simultaneous learning of 
language and acoustic models. To develop such an integrated 
theoretical model, the authors introduced the general concept 
of double articulation analysis. 

E. Double articulation analysis 

From a general point of view, unsupervised word discovery 
from raw speech signals is regarded as a double articulation 
analysis of the time series data representing a speech signal. 
The double articulation structure is a well-known two-layer 
hierarchical structure, i.e., a word sequence is generated from 
a language model, a word is a sequence of phonemes, and each 
phoneme outputs observation data during the period it persists. 
The word discovery problem becomes a general problem about 
analyzing the time series data that potentially have a double 
articulation structure by estimating the latent acoustic model 
as well as the latent language model. 

Taniguchi et al. Ii34l proposed a double articulation ana¬ 
lyzer (DAA) by combining the sticky HDP-HMM and the 
NPYLM. The sticky HDP-HMM proposed by Fox et al. is 
an nonparametric Bayesian extension of HMM 11351 . They 
applied the DAA to human motion data to extract unit motion 
from unsegmented human motion data. However, they simply 
used the two nonparametric Bayesian methods sequentially. 
They did not integrate the two models into a single generative 
model. Therefore, if there are many recognition or categoriza¬ 
tion errors in the result of the first latent letter recognition 
process, i.e., segmentation process by the sticky HDP-HMM, 
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Fig. 1. Overview of unsupervised learning of language and acoustic models through human-robot interaction, and the generative process of speech signal 
assumed in the DAA 


the performance of the subsequent process, i.e., unsupervised 
chunking by the NPYLM, deteriorates. In the terminology of 
a DAA, a latent letter and a latent word basically correspond 
to a phoneme and a word in speech signals, respectively. In 
this paper, we call this method “conventional DAA” in order 
to differentiate it from the DAA newly proposed in this paper, 
i.e., NPB-DAA. Conventional DAA has been successfully 
applied to human motion data and driving behavior data, 
which were also considered to potentially have a double 
articulation structure. Conventional DAA has been used for 
various purposes, e.g., segmentation lf36ll . prediction (321, OH, 
data mining |j39l , topic modeling KOl . ATI , and video sum¬ 
marization m. Conventional DAA owes its successful result 
with respect to driving behavior data to the fact that driving 
behavior data were continuous and smooth compared with raw 
speech signals. For a driving letter, which corresponds to a 
phoneme in continuous speech signals, the recognition error 
rate was still low. However, it is expected that a straightforward 
application of the conventional DAA to raw speech signals will 
inevitably turn out badly. 

Therefore, based on the background mentioned above, in 
this paper, we propose an integrated probabilistic generative 
model, HDP-HLM, representing a latent double articulation 
structure that contains both a language model and an acoustic 
model. By assuming HDP-HLM as a generative model of 
observed time series data, and by inferring latent variables of 
the model, we can analyze latent double articulation structure 
of the data in an unsupervised manner. A novel double articu¬ 
lation analyzer is developed on the basis of the HDP-HLM 
and its inference algorithm. This HDP-HLM-based double 
articulation analysis method is called NPB-DAA. 


III. Generative model 


In this section, we propose a novel generative model, the 
HDP-HLM, for time series data that potentially has a double 
articulation structure, by extending HDP-HSMM |6j. As in¬ 
dicated in its name, HDP-HLM latently contains a language 
model. In contrast with the conventional case where a latent 
state transits to the next state on the basis of a Markov process 
in the HDP-HMM, a latent word in the HDP-HLM transits to 
the next latent word on the basis of a language model. An 
illustrative overview of the proposed method and the target 
task are shown in Fig. |T] We can naturally derive an inference 
procedure for the HDP-HLM based on the blocked Gibbs 
sampler. First, we briefly describe the HDP-HSMM. We then 
describe the HDP-HLM. 


A. HDP-HSMM 


HDP-HSMM is a nonparametric Bayesian extension of 
the conventional hidden semi-Markov model (HSMM) a, 
f43l . Unlike HDP-HMM, which is an nonparametric Bayesian 
extension of conventional hidden Markov model (HMM) l35l . 
na, the HDP-HSMM explicitly models the duration time 
of a hidden state. A graphical model of the HDP-HSMM is 
shown in Fig. [2] The generative process of the HDP-HSMM 
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11=1 t2—Dl+1 tj—T-D5H 
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Fig. 2. Model of the HDP-HSMM gj 

is described as follows. 

P ~ GEM(y) (1) 

Kj ~DP(a, j3) i=l, 2,...,00 (2) 

(0„® ; )~//xG i= 1,2,...,°° (3) 

Zs~nz s -! s = 1,2,...,5 (4) 

D s ~g(co Zs ) (5) 

x,=z s t=tl,t x s +\,...,t 2 s (6) 

* = *(&>) ( 7 ) 

tl = E Op (8) 

r s 2 = +/),-! (9) 

where GEM and DP represent the stick breaking process and 
Dirichlet process, respectively ®, eu. The parameters 7 
and a are hyperparameters of the DP, /j is a global transition 
probability that becomes the base measure of the transition 
probability distributions, and 71, is a transition probability 
distribution related to the i-th super state. Variable z s is the 
.v-th super state in the sequence of super states, D s is the frame 
duration of z s , and the variables x, and y, are a hidden state 
and an observation at time frame 1 , respectively. Parameters of 
an emission distribution and a duration distribution for the i- 
th super state are described as 0, and CO,. Additionally, H and 
G are base measures for emission distribution and duration 
distribution. The function h and g represent emission and 
duration distributions, respectively. The time frames t x and 
t 2 are frames corresponding to a start point and a end point 
of a segment corresponding to z s - 

In contrast with the case where HMM assumes that a hidden 
state x t transits to the next hidden state x t +\ according to 
a Markov process, the hidden semi-Markov Model (HSMM) 
assumes that a hidden super state z s transits to next hidden 
super state z s +i after a probabilistically determined duration 
time D s , which is sampled from a duration distribution g(C0 Zs ) 
The super state z s is sampled from a categorical distribution 
7t Zs l related to the previous super state z s _i. When the super 


state z s and duration time D s are sampled, a sequence of hidden 
states {x t | 1 + JT'f ^ Dy < t < Ly = i Gy ) are determined to be 

Z.s - 

An observation datum y t at time t is assumed to be drawn 
from an emission distribution h whose parameter is 9 X , ■ 
Observation data y, are generated by h(Q Xt ) for D s steps. 

An efficient sampling inference procedure based on the 
backward filtering forward sampling technique was proposed 
for constructing a blocked Gibbs sampler ( 6 j. A similar 
algorithm was proposed for HDP-HMM by Fox et al. m. 
The algorithm is derived from a weak-limit approximation of 
the number of hidden super states. The computational cost of 
the message passing algorithm can be reduced to 0(Td max N 2 ), 
where T is the length of the observed data, N is the state 
cardinality, and 1 i max is the maximal duration of a super state 
for truncation. The order is almost the same as that of the 
backward filtering forward sampling algorithm for the HDP- 
HMM, except for the constant factor t/ max . 

B. HDP-HLM 

The generative model for time series data that poten¬ 
tially have a double articulation structure can be obtained 
by extending the HDP-HSMM. A graphical model of the 
proposed HDP-HLM is shown in Fig. [3] In the generative 
model of HDP-HLM, the super state z s corresponds to a word 
in spoken language, which is the fundamental idea of the 
extension. The i-th super state z s = i has a phoneme sequence 
Wi = (w,i, ... ,Wik,.. ., WiLj), where L,- is the length of the i-th 
word Wi. The generative process of the HDP-HLM is described 
as follows. 


P LM ~ GEM(t lm ) 


(10) 

n\ M ~ DP(a iM , j3 iM ) 

i = 1 ,2 ,...,°° 

(11) 

\ WM ~ GEM(7 wm ) 


(12) 

;f M ~DP (cc WM ,p WM ) 

j = 1 ,2 ,...,°° 

(13) 

... >TrWM 

W * ~ n »7k-l 

i = 1 ,2 ,... ,°o, 

k = 1,2,... ,L,- 

(14) 

{dj,(Oj)~HxG 

j= 1 ,2 ,...,°° 

(15) 

7 ~ k lm 

Zs n Zs -1 

s = 1,2,...,5 

(16) 

hk = tV z s k 

5 = 1,2,...,5 

(17) 


1 ? 

<1 

(N 

II 

(18) 

Dsk ~ g((Oi sk ) 

5= 1 , 2 ,. ,.,s 

(19) 


k = 1,2 ,L Zs 

(20) 

II 

II 

A 2 

l ski' • • l sk 

(21) 

A - 

T sk - 

= E Ds' + E Dsk' + 1 

(22) 


s'<s k<k 


1 2 _ 
l sk ~ 

= t sk+D s k— 1 

(23) 

II 

qy 

II 

1 ,2 ,... ,T 

(24) 


where fi WM is the base measure and a WM and y WM are 
hyperparameters of a word model, which generates words, i.e.. 
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Fig. 3. Model of the proposed HDP-HLM 


latent letter sequences. Furthermore, DP(a u v, . p WM ) outputs 
kJ m , representing the transition probability from latent letter 
j to the next latent letter. By contrast, /3 LM is the base measure, 
a LM and y 1M are hyperparameters of the language model, 
and DP(a iM , j3 LM ) outputs , representing the transition 
probability from latent word i to the next latent word. The 
superscripts LM and WM indicate language model (LM) or 
word model (WM), respectively. The latent letters contained 
in the i-th latent word w, are sequentially sampled from . 
The A'-th latent letter of the i-th latent word is represented by 
Wjk- The emission distribution h and the duration distribution 
g have parameters Oj and OJ, for the j-th latent letter, re¬ 
spectively. The base measures H and G generate 0 ; and Wj, 
respectively. Variable z s is the i-th latent word in the sequence 
of latent words, and corresponds to the super state in HDP- 
HSMM, D s is the frame duration of z s , hk = w z s k is the k- 
th latent letter of the s-th latent word, and D s k is the frame 
duration of l s k . The variable x t and y, are a hidden state and 
an observation at time frame t, rspectively. The time frames 
t\ and t^ k are frames corresponding to a start point and a end 
point of a segment corresponding to l s k, respectively. 

In contrast with HMMs, the duration distribution is explic¬ 
itly determined for each latent letter l s k in the HDP-HLM. The 
HDP-HLM inherits this property from the HDP-HSMM |6). 
The duration time D s k of latent letter which is the £-th 
latent letter of the i-th latent word z s in a sampled word se¬ 


quence, is drawn from the duration distribution g(a>i sk ), where 

a>i, is the duration parameter for latent letter l s k . The duration 

5 L 

of a latent word w Zs becomes D s = D s k- If we assume 

that g is a Poisson distribution, the duration distribution of a 

latent word z s also follows a Poisson distribution. In this case, 

the Poisson parameter of the duration distribution becomes 

! C0[ sk . This relation owes to the reproductive property of 

Poisson distributions. 

In the HDP-HLM, latent word z s determines a latent letter 
sequence l s k = w Zs k (k=l,2,...,L Zs ). Based on the determined 
sequence w Zs , duration D sk of is drawn, and observations y, 
are drawn from an emission distribution h(9 Xt ) corresponding 
to x t = l s (t)k(t)- The maps s(t) and k{t) represent the indices 
of words and letters, respectively, in a latent word sequence 
at time t. Using this generative model, a continuous time 
series data with a latent double articulation structure can be 
generated. In this paper, we assume that observed time series 
data y t represents a feature vector of the speech signal at time 
t and is generated in this way. Generally, the HDP-HLM can 
be applied to any kind of time series data that has a double 
articulation structure. 

From the viewpoint of language acquisition, we review the 
generative model. In the conventional DAA 04) . a DAA is 
composed of two separated machine learning methods, i.e., 
sticky HDP-HMM for encoding observation data to letter 
sequences and NPYLM for chunking letter sequences into 
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word sequences. On the one hand, the transition probabilities 
7T ( iM and Jif M correspond to the word bigram and letter bigram 
models in the NPYLM, respectively. Therefore, (n LM ,n WM ) 
contains information regarding a language model. On the 
other hand, {(Oj,9j}j= 1,2,...,°° contains information regarding 
an acoustic model, which corresponds to a sticky HDP-HMM 
in conventional DAA. 

The HDP-HLM assumes that the language model consists of 
a word bigram model. Mochihashi et al. compared the bigram 
and trigram language models and showed that the trigram as¬ 
sumption hardly improved the word segmentation performance 
although computational cost and complexity increased HD- 
Therefore, the bigram assumption must be appropriate for a 
word segmentation and word discovery task. 

If we derive an efficient inference procedure for this two- 
layer hierarchical generative model, the inference procedure 
can infer the acoustic model and language model simultane¬ 
ously. 

IV. Inference algorithm 

In this section, we derive an approximated blocked Gibbs 
sampler for the HDP-HLM. The sampler can simultaneously 
infer latent letters, latent words, a language model, and an 
acoustic model. Concurrently, the inference procedure can 
estimate the overall double articulation structure from continu¬ 
ous time series data. Therefore, we propose the unsupervised 
machine learning method NPB-DAA. The overall inference 
procedure is shown in Algorithm 1. 

A. Inference of latent words: z s 

In the HDP-HSMM, a backward filtering forward sampling 
procedure is adopted instead of the direct assignment pro¬ 
cedure. When each latent state strongly depends on other 
neighboring latent states, the direct assignment procedure, 
which is a naive implementation of the Gibbs sampler, results 
in a poor mixing rate (6[. Johnson et al. showed that a blocked 
Gibbs sampler using a backward filtering forward sampling 
procedure that can simultaneously sample all hidden states 
of an observed sequence outperforms a direct-assignment 
Gibbs sampler. By extending the backward filtering forward¬ 
sampling procedure and making it applicable to HDP-HLM, 
we can obtain an inference procedure for HDP-HLM. 

The calculation of the backward messages for super states 
i in HDP-HSMM is as follows. 

B t (0 = P(y t +V.T | Z s (t) = i,F, = 1) (25) 

= XX U)P(zs(t+ 1 ) = j I ^(0 = 0 (26) 

j 

*?(i) = P(yt+ 1 :T I Z,(r+1) = i,F t = 1) (27) 

T-t 

( i) p i D t+i = d | z s ( f+ i) = i ) 

d= 1 

X P(yt+v.t+d I Zs(»+1) = i,Dt +1 = d) (28) 
B r (i) = 1 (29) 

where F t is a variable indicating that t is the boundary of the 
super state. If F, = 1, z s (t) ^Zs(t+i)- The variable B t [i) in (l25l > 


represents the probability that the latent super state z s {t) = i 
and that it transitions into a different super state at the next 
time step. Probability B, (i) is obtained by marginalizing over 
all super states j at time step t + 1 . Variable B* ( j ) in < l27b 
represents the probability that the latent super state becomes 
j from time step t + 1. This probability can be obtained by 
marginalizing over the duration variable in (l28l i. Probability 
P(y t+ i :t +d I z s (m-i) = i,F>t+ 1 = d) in (l28l ) shows the emission 
probability of observed data yt+i:t+d given the condition that 
the duration D t+ 1 of z s (t+i) is d. In the HDP-HSMM, all time 
steps with the same super state z share the same emission 
distribution. Therefore, the likelihood of a super state z s ( f+ i), 
i.e., P(y t +i-.t+d I z*+i,A+i = d), can be calculated easily. 

Surprisingly, in HDP-HLM, the exact same procedure of 
calculating backward messages as that of HDP-HSMM can 
be used. We obtain a message passing algorithm for HDP- 
HLM by replacing a super state z s in HDP-HSMM with latent 
word z s in HDP-HLM. Only the likelihood of the latent word 
w s , i.e., P{y t+ i :t+ d | z s ( f+ i) = i, A+l = d), is different between 
the two message passing algorithms. The likelihood of the 
occurrence of latent word zqr+i) = i then becomes 

P(y t +ut+d I z s ( t+ 1 ) = i,D t+ 1 = d) 

Li 

= e n^i«w) x 

rER^i’d) k= 1 

m= 1 K 1 

^ / -‘ 4) = |Hkl=T,-EG=^| (31) 

where |x| indicates the number of elements in vector x, 
and r= (r,-,r2 ,... . rif) is an L,-partition of duration d. By 
substituting (f30t into (128b . we can obtain a formula to calculate 
the backward message of HDP-HLM. 

The calculation of (l30t looks complicated at first glance. 
However, the value of (l30b can be efficiently calculated using 
dynamic programming. If we define forward message a t (k) 
as the probability that the k- th latent letter in the relevant 
latent word w\ transits to the next latent letter at time t 
after emitting observations, forward message a, (k) can be 
recursively calculated as follows: 

f-*+l d'~ 1 

ar(k) = £ <w(k- \)P[d' | civ,,,) n P(y t -,' I B Wik ) (32) 
d'=1 r'=0 

Oo(0) = l. (33) 

As a result, P(y t+l:t+d \ z j(f+1 ) = i,A+t = d) = a d (Li). By 
applying the calculation formula shown above, backward mes¬ 
sages B t (i) and B*(i) can be calculated. Using the calculation 
procedure for backward messages, the forward sampling pro¬ 
cedure proposed in the HDP-HSMM can be employed. The 
backward filtering forward sampling procedure enables the 
blocked Gibbs sampler to directly sample latent words from 
observation data without explicitly sampling latent letters in 
HDP-HLM. 


In the forward sampling procedure, super state z s and its 
duration D s are sampled iteratively using backward messages 
as follows. 

P{zs = i\y\-.T,Zs-\ = j > Fof™ = 1 ) = 

p{z s = i\z s ~ 1 = j)B D sum (i)P(yDmm |z s = i) (34) 

P(Ds = d\yi-T,z s = i- Pof m = 1 ) = P{D S =d)x 

P(yof m +i:D™ m +d\Ds = d,z s = i,F D |um = l)B D sum +d (i) 

B*D jum(0 

(35) 

where Dj um = )T V / <V Dy. For further details, please refer to the 
original paper, in which the HDP-HSMM was introduced (6). 


B. Sampling a letter sequence for a latent word: Wj 


The sampled z s is only an index of a latent word. Concrete 
letter sequences w, for each latent word i should be sampled 
according to the correspondence of each sub-sequence of 
time series data y k = (y k . / y\,....y k , k ) to each latent word. 
When a latent word Zs is given, the generative model of 
the observation in the range of a latent word Zs can be 
regarded as an HDP-HSMM whose super states correspond 
to latent letters. Therefore, in the proposed model, each sub¬ 
sequence of observation data corresponding to a latent word 
can be considered an observed sequence generated by an 
HDP-HSMM. If only a single sub-sequence of observations 
corresponds to a latent word, a latent letter sequence could be 
sampled using an ordinal sampling procedure in the HDP- 
HSMM. However, observations containing the same latent 
word have to share the same latent letter sequence w. There¬ 
fore, latent letter sequences for observations with the same 
latent word are simultaneously sampled, given that they have 
the same latent letter sequence. We employ an approximate 
sampling procedure based on sampling importance resampling 
(SIR) fl46). 

If we define the observations sharing the same latent word 
as y X:k = {y 1 , y 2 ,..., y k } and the shared latent letter sequence 
as w, the posterior probability P(w | y X:k ) becomes 


P(w\y l:k )~P(w)P(y l - k \w) 


= p(w\ y ] \ P(y ] )Y\P(y l \ w) 

sampling v v 

weight 


(36) 

(37) 


where P(y 7 ) in ( 13 7| i. representing the likelihood of the 
observation, can be calculated using the backward filtering 
procedure in the HDP-HSMM. Probability P(y' \ w) can also 
be calculated in the same way as (l30l > if w is given. The HDP- 
HSMM also provides a sampling procedure for P(w | y J ). 
Therefore, if we consider P{w | y 1 ) as the proposed distribution 
and P(yj)H^ jP(y' | w) as a weight, the SIR procedure can 
be employed [46]. Specifically, after a set of w are sampled 
from the proposed distribution P{w | y 7 ) j = 1,2,...,k, a final 
sample is drawn from the set with a probability proportional 
to each sample’s weight. Using this procedure, the proposed 
model can approximately sample a latent letter sequence w, 
for the i-th latent word. 


C. Sampling model parameters 

After sampling latent words {z s } for each observation data 
and sampling letter sequences for the latent words, other 
parameters can be updated. Parameters of the language model, 
i.e., { n kM } and [5 LM , can be updated on the basis of latent 
word sequences. Parameters of the word model, i.e., {} 
and fi WM , can be updated on the basis of sampled letter 
sequences for latent words. Parameters for the acoustic model, 
i.e., { 0 )j } and { Oj }, can be updated if each hidden state x, is 
determined for each y t . During the SIR process for sampling a 
letter sequence, {wf} in Algorithm 1 are subsidiarily obtained. 
To accelerate the mixing rate, the subsidiary sampling results 
{ w ™} obtained in the SIR are used for updating the acoustic 
model parameters. These parameters can be sampled in the 
same way as the HDP-HSMM. For more details, we refer 
to the original paper in which the HDP-HSMM were intro¬ 
duced J6). Finally, the overall sampling procedure is obtained, 
as described in Algorithm Q] 

D. NPB-DAA 

Based on the generative model, HDP-HLM, and its infer¬ 
ence algorithm shown in AlgorithmQ] the proposed NPB-DAA 
is obtained, finally. By assuming HDP-HLM as a generative 
model of observed time series data, and by inferring latent 
variables of the model, we can analyze latent double articu¬ 
lation structure, i.e., hierarchically organized latent words and 
phonemes, of the data in an unsupervised manner. We call the 
novel unsupervised double articulation analyzer NPB-DAA. 

V. Experiment 1: Synthetic data 

We conducted an experiment using a synthetic dataset that 
explicitly has a double articulation structure to validate our 
proposed method. 

A. Conditions 

To validate the ability of our proposed method to infer 
a latent double articulation structure in time series data, we 
applied the proposed NPB-DAA based on the HDP-HLM to 
synthetic time series data. The conventional DAA was em¬ 
ployed as a comparative method. The time series data are gen¬ 
erated using five letters {j}jej = {1,2,3,4,5} and four words 
{w} w ew = {[1,3,5], [3,2], [4,1,5,2], [1,5]} where / is a set of 
letters and W is a set of words. The four words were generated 
randomly. The sequence w,- = [w,i, wq,...,w,l ; ] represents a 
word that is generated by combining [wn : wq. ■ ■ ■, w,x ( } se¬ 
quentially where denotes the k-th letter of w, . The durations 
of the letters were assumed to follow Poisson distributions and 
their parameters were drawn from a Gamma distribution whose 
parameters were a = 50, j3 = 10. The emission distribution 
was assumed to be a Gaussian distribution whose parameters 
were pt =5 i,<J 2 £ {0.1,0.5,1.0}, where i represents the index 
of latent letters. The variance of the emission distribution was 
changed in stages, and the inference results were compared. 
Forty time series data items were generated from 20 types 
of latent word sequences. Sixteen of them were pairs of 
words in W, e.g., ([1,3,5], [1,5]) , and ([3,2], [3,2]). Four of 
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Algorithm 1 Blocked Gibbs sampler for HDP-HLM 
Initialize all parameters. 

Observe M time series data { y™T m } m e{\, 2 ,...,M }• 

repeat 

for m = 1 to M do 

// Backward filtering procedure 
For each i £ {1,2,...,A}, initialize messages Bj(i) = 
1 . 

for t = T to 1 do 

For each i £ (1,2,..., N}, compute backward mes¬ 
sages and B* , (/') using (l25l>—(|28]> . 

end for 

// Forward sampling procedure 
Initialize s = 1 and D* um = 0 
while D sum ' < T m do 

// Sampling a super state representing a latent word 


Zs ~ P{Z S | y? :Tm ,Zs-l,F D f ™ = 1 ) 

// Sampling duration of the super state 

A ~ p(D s \z s ,F]jsmi = 1) 


D sum D sum _ 


■D, 


S £- S+ 1 

end while 

// Sampling a tentative latent letter sequences 
for s = 1 to S'" do 


‘P(w bl 




1.2 ./) 


end for 
end for 

// Update model parameters 

Sample acoustic model parameters {a>j,6j} on the basis 
of tentatively sampled latent letter sequences {vv™}. 
Sample language model parameter { n \ M }, [i LM on the 
basis of sampled super states , i.e., latent words. 

Sample a word inventory {w/}i=i.2....,iv using SIR proce¬ 
dure (see 03 ). 

Sample a word model { nf M }, /3 WM on the basis of 
sampled word inventory 

until a predetermined exit condition is satisfied. 


them were three-word sentences, e.g., ([3,2], [1,3,5], [1,5]). A 
sequence of latent words is represented by (wi,W2 ,..., w„). 
Two observations were generated from each word sequence. 

We set the parameters of the NPB-DAA as follows: the 
hyperparameters for the latent language model were y LM = 
10 .0, a LM = 10.0, and the maximum number of words was 
six for weak-limit approximation. The hyperparameters for 
the latent word model were y WM = 10.0. a WM = 10.0, and 
the maximum number of letters was seven for weak-limit ap¬ 
proximation. The hyperparameters of the duration distributions 
were set to a = 50 and /3 = 10, and those of the emission 
distributions were set to jx o = 0,<7q = 1.0, Kb = 0.01, Vo = 1. 
The Gibbs Sampling procedure was iterated 100 times. 

For the conventional DAA, we set the hyperparameters of 
the sticky HDP-HMM to be as similar to those of the NPB- 


DAA as possible. In this condition, the latticelm softwar^D 
developed by Neubig et al. was used for NPYLM. The 
hyperparameters of the NPYLM used in the conventional DAA 
were set to a = 0.1 and d = 0.1. 

The hyperparameters in the NPB-DAA were heuristically 
given in a top-down manner by referring to the size of the state 
space and the approximate duration of a phoneme. Those of 
the Pitman-Yor language model were set to the default values 
of the software. 


B. Results 

The average log-likelihood is shown in Fig. [4] where error 
bars represent the standard deviation of 30 trials. These results 
show that the proposed inference procedure worked appropri¬ 
ately, gradually sampling more probable latent variables as the 
iterations increased. 

In contrast with ordinal speech recognition tasks, the target 
task (language acquisition and double articulation analysis) is 
an unsupervised learning task. Specifically, it is a clustering 
task. Therefore, it is difficult to evaluate the methods’ perfor¬ 
mance from the viewpoint of precision and recall because the 
estimated index of a cluster and the label corresponding to 
the ground truth data are usually different. We evaluated the 
obtained result using the adjusted rand index (ARI), which 
quantifies the performance of a clustering task |47| . If all 
data items are clustered randomly or only to one cluster, the 
ARI becomes 0. By contrast, if the results of clustering are 
the same as those of the ground truth data, the ARI becomes 
1. 

Table 0 shows the ARI for the estimated latent letters. 
The ARI for estimated latent letters shows how accurately 
each method estimated latent letters, which correspond to 
phonemes in speech signals. Table [D] shows the ARI for 
estimated latent words. The ARI for estimated latent words 
shows how accurately each method estimated latent letters, 
which correspond to words in speech signals. In both tables, 
each column shows ARIs for different cr 2 . A higher ARI 
implies more accurate estimation of the latent variables. 

Although the ARI for the latent letters obtained by conven¬ 
tional DAA decreases when the variance (7 2 increases, that 
of NPB-DAA did not decrease as much. As the ARIs for 
latent words show, the performance of word segmentation by 
conventional DAA was poor, even when the ARI for latent 
letters was larger than 0.8. In contrast, the ARI for latent 
words estimated by NPB-DAA was over 0.5 in all conditions. 
This shows that the NPB-DAA can mitigate the ill effects of 
phoneme recognition errors in the word segmentation task, and 
obtained knowledge about words can improve phoneme recog¬ 
nition performance by using contextual information. Fig. [5] 
shows the change in ARI through iterations in the case of 
G~ = 1.0. This shows that the ARI also increased gradually 
while log likelihood increases, as in Fig. Q] These results 
suggest that the NPB-DAA is an appropriate generative model 
because better word segmentation performance corresponded 
to higher likelihood of the model. 

1 latticelm: http://www.phontron.com/latticelm/index.html 
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Fig. 4. Log-likelihood profile through Gibbs sampling (a 2 = 1.0) 



To check the effects of the limit on weak-limit approxi¬ 
mation, we ran an experiment where the maximum number 
of letters was 20 for weak-limit approximation. The ARI for 
the estimated latent words were {0.682,0.650,0.604}, those 
for estimated latent letters were {0.967,0.899,0.878}, and 
the estimated number of latent letters were {5.6,6.3,6.6} on 
average for cr 2 = {0.1,0.5,1.0}. This result shows that our 
model can work appropriately to estimate the number of latent 
states owing to the nature of Bayesian nonparametrics when 
the limit is sufficiently large. 

An example of estimated latent variables is shown in Fig. [6] 
which shows the results for time series data generated from 
the latent word sequence ([3,2], [1,3,5], [1,5]). The input time 
series data is shown at very top of the figure. The top of each 
panel shows the true latent letters or latent words, whereas 
the panel beneath shows the inferred results. The vertical axes 
represent the iteration of the Gibbs sampling. In Fig. [6] the 
figure in the middle shows a latent word sequence estimated 
using the proposed method, and the figure at the bottom shows 
the estimated boundaries of the latent words. These results 
show that the inference procedure works consistently and can 
estimate an adequate boundary for the latent words given the 
data. 




Fig. 6. Example of inference results for sample data ([3,2], [1,3,5], [1,5]) and 
a~ = 1.0: (top) observation data, (upper middle) latent letters, (lower middle) 
latent words, and (bottom) the boundaries of latent words. Different colors 
denote different states. 
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TABLE I 

ARI FOR ESTIMATED LATENT LETTERS 


o l 

0.1 

0.5 

1.0 

Conventional DAA 
(sticky HDP-HMM) 

0.845 

0.832 

0.649 

NPB-DAA 

0.984 

0.895 

0.938 


TABLE II 

ARI FOR ESTIMATED LATENT WORDS 


o l 

0.1 

0.5 

1.0 

Conventional DAA 

(sticky HDP-HMM + NPYLM) 

0.122 

0.107 

0.125 

NPB-DAA 

0.594 

0.509 

0.618 


These results show that the proposed method is a more 
effective machine learning method for estimating a latent 
double articulation structure embedded in time series data. 

VI. Experiment 2: Continuous Japanese Vowel 
Speech Signal 

In the second experiment, we evaluated our proposed 
method using Japanese vowel speech signals to test the appli¬ 
cability of the proposed method to actual human continuous 
speech signal. 

A. Conditions 

We prepared four datasets. Each dataset corresponds to a 
speaker, and consisted of 60 audio data items. We asked two 
male and two female Japanese speakers to read 30 artificial 
sentences aloud two times at a natural speed, and recorded 
his/her voice. The 30 sentences were prepared using five words 
{aioi, aue, ao, ie, uo}, which consisted of five Japanese vowels 
{a, i, u, e, o} representing {a, i, uiP, e, o} in phonetic symbols 
respectively. By reordering the five words, we prepared 25 
two-word sentences, e.g., “ao aioi,” “uo aue,” and “aioi aioi,” 
and five three-word sentences, i.e., “uo aue ie,” “ie ie uo,” 
“aue ao ie,” “ao ie ao,” and “aioi uo ie.” The set of two-word 
sentences consisted of all types of word pairs (5x5= 25). 
The set of three-word sentences were generated randomly. 

The recorded data were encoded into 13-dimensional mel- 
frequency cepstrum coefficient (MFCC) time series data using 
the HMM Toolkit (HTK/0. The frame size and shift were set 
to 25 and 10 ms, respectively. Twelve-dimensional MFCC data 
was obtained as input data by eliminating power information 
from the original 13-dimensional MFCC data. As a result, 12- 
dimensional time series data at a frame rate of 100 Hz were 
obtained. 

The hyperparameters for the latent language model were 
set to y' M = 10.0 and a LM = 10.0, and the maximum number 
of words was set to seven for weak-limit approximation. The 
hyperparameters for the latent word model were y WM =10.0 
and a WM = 10.0, and the maximum number of letters was 
seven for weak-limit approximation. The hyperparameters of 
the duration distributions were set to a = 200 and p = 10, and 

-Hidden Markov Model Toolkit: http://htk.eng.cam.ac.uk/ 


those of the emission distributions were set to /do = 0, Oq = 
1.0, Kq = 0.01, and Vo = 17 = (dimension+5). 

For the conventional DAA, we set the hyperparameters of 
the sticky HDP-HMM to be as similar to those of the NPB- 
DAA as possible. The hyperparameters for the NPYLM used 
in the conventional DAA were set to a = 0.1 and d = 0.1. 
The Gibbs sampling procedure was iterated 100 times. With 
different random number seeds, 20 trials were performed. 

The parameters in the NPB-DAA were given in a top-down 
manner heuristic ally by referring to the size of the state space 
and the approximate duration of a phoneme. Those of the 
Pitman-Yor language model were set to the default values of 
the software. 

As a baseline method, we employed an open-source contin¬ 
uous speech recognition engine, Juliusj^] which is widely used 
in Japanese speech recognition tasks. Julius’s acoustic model is 
trained by using a large number of speech data in a supervised 
manner. We prepared four conditions for Julius. The first one 
was called “Julius (phoneme + NPYLM).” In this condition, 
we used Julius as a phoneme recognition system by preparing 
a phoneme dictionary containing five Japanese vowels {a, i, u, 
e, o}. Moreover, Julius’s dictionary also contains silB and silE 
to represent silence due to system requirements. After encod¬ 
ing continuous speech signals into phoneme sequences using 
Julius as a phoneme recognizer, unsupervised morphological 
analysis based on the NPYLM was conducted to discover 
words and a language model. The second condition was called 
“Julius (phoneme + latticelm).” In this condition, we also used 
latticelm, which is an unsupervised morphological analyzer for 
lattice output from an ASR system. The method was proposed 
by Neubig et al. as an extension of Mochihashi’s NPYLM (24l . 
In this condition, the latticelm software was used too. 

In the third and fourth conditions, called “Julius (mono¬ 
phone + word dictionary)” and “Julius (triphone + word dictio¬ 
nary),” respectively, we prepared a complete word dictionary 
that contained all of the words that appeared in the target 
speech signal, i.e.,{aioi, aue, ao, ie, uo}, for Julius. This 
condition provides almost an upper bound for the performance 
of our task. Except for in “Julius (triphone + word dictionary),” 
Julius uses a monophone-based acoustic model contained in 
the dictation kit. The acoustic model is trained in a supervised 
manner using a large number of labeled speech data. “Julius 
(triphone + word dictionary)” used a triphone-based acoustic 
model for comparison. 

B. Results 

We provided word and letter ground truth labels to all 
frames of the speech signal data and evaluated the relationship 
between the truth labels and estimated latent letter and word 
indices. 

The results are shown in Table [III] Check marks in the AM 
and LM columns indicate that the method used a pretrained 
acoustic model (AM) and the given true language model (LM), 

3 Open-Source Large Vocabulary CSR Engine Julius: 
http://julius.sourceforge.jp/ The Linux binary diCtation-kit-v4.3.Tlinux.tgz 
was used in this experiment. The software encodes the recorded data into 
36-dimensional MFCC data including dynamic features and uses them for 
speech recognition. 
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TABLE III 

ARI FOR ESTIMATED LATENT LETTERS AND WORDS 


Method 

Letter ARI 

Word ARI 

AM 

LM 

NPB-DAA (MAP) 

0.596 

0.529 



NPB-DAA 

0.561 

0.401 



Conventional DAA 

0.590 

0.090 



Julius (phoneme dictionary 
+ NPYLM) 

0.486 

0.297 

/ 


Julius (phoneme dictionary 
+ latticelm) 

0.554 

0.337 

/ 


Julius (monophone 

+ word dictionary) 

0.586 

0.487 

/ 

/ 

Julius (triphone 
+ word dictionary) 

0.548 

0.616 

/ 

/ 


respectively. Letter ARI shows the ARI of phoneme clustering. 
A high Letter ARI means more accurate phoneme acquisition 
and recognition. Word ARI shows the ARI of word clustering. 
A higher Word ARI means more accurate word discovery 
and recognition. Each row corresponds to each method ex¬ 
plained in the conditions. The results of “NPB-DAA” and 
“Conventional DAA” show the ARI averaged over 20 trials. 
In contrast, “NPB-DAA (MAP)” obtained the maximum a 
posteriori probability (MAP) of the 20 trials. An advantage of 
the NPB-DAA is that the method can calculate the posterior 
probability of a given dataset after the learning phase because 
the NPB-DAA is derived from a generative model, i.e., HDP- 
HLM, which integrates the language and acoustic models. In 
contrast with the conventional DAA and similar methods that 
do not have appropriate generative models, the NPB-DAA 
can obtain an appropriate learning result by referring to the 
probability. The rows with MAP in Table [III] show that this 
probability is an adequate criterion for selecting a learning 
result. 

The results show that the “NPB-DAA (MAP)” outperformed 
not only the conventional DAA but also lulius-based word 
discovery systems whose acoustic models were trained in 
supervised manner. One reason is that the acoustic models 
of the DAAs were trained only from one participant’s speech 
signals, in contrast, Julius’s acoustic model was trained by 
the speech signals of many speakers. In other words, NPB- 
DAA acquired speaker-dependent acoustic model in contrast 
with that Julius used speaker-independent acoustic model. 
This adaptation of acoustic model to the speaker must have 
increased the NPB-DAA’s performance. 

The results show that a naive application of the NPYLM 
to recognized phoneme sequences results in poor word ac¬ 
quisition performance, especially in conventional DAA. Be¬ 
cause the theory of the NPYLM does not presume that letter 
sequences have recognition errors, the existence of phoneme 
recognition error deteriorates word segmentation performance. 
The methods that simply apply an NPYLM to obtained 
phoneme sequences, i.e., the conventional DAA and Julius 
(phoneme dictionary + NPYLM), output bad results in the 
word ARI compared with those of the letter ARI. However, 
latticelm, which presumes phoneme recognition errors to some 
extent, could not dramatically improve the performance of 


word acquisition in our experimental setting. 

In contrast, “Julius (triphone + word dictionary)” improved 
its word ARI performance with respect to letter ARI perfor¬ 
mance. “Julius (monophone + word dictionary)” also kept its 
performance high with respect to the word recognition task 
compared with the phoneme recognition task. We note that 
the word error rate was 32.8% and the phoneme error rate 
was 28.1% in Julius (monophone + word dictionary). 

In the research field of ASR, it is widely known that a good 
language model improves word and phoneme recognition per¬ 
formance. The NPB-DAA could not improve the performance 
of word ARI with respect to letter ARI performance. However, 
it obtained an adequate language model and prevented the 
score of the word ARI from becoming far worse than that 
of the letter ARI. To achieve such an error-proof word ac¬ 
quisition, the direct inference of latent words are important in 
NPB-DAA. In the inference procedure described in SectionlTm 
latent words are sampled directly without sampling latent 
letters while marginalizing all possible latent letter sequences. 
This achieves an effect similar to that of a given language 
model in the inference process 

Typical examples of the estimation results are shown in 
Table[IV]for NPB-DAA and conventional DAA. Each number 
in parentheses represents an estimated phoneme label, each 
space represents a phoneme boundary, each number in bold 
style represents a sampled index of a word, and “/” represents 
a boundary between successive words. For example, “ao ie” 
was divided into two words, i.e., “5 0 1” and “6 3 4 6,” in 
the NPB-DAA results, and their word indices were 3 and 4. 
In Table IIVI the sampled letters corresponding to the word 
“ie” are underlined. Although conventional DAA could not 
estimate “ie” as a single word, the NPB-DAA could estimate 
“ie” to be a single word: “4.” In the conventional DAA results, 
several phoneme recognition errors can be found. The errors 
completely deteriorated the following chunking process, i.e., 
unsupervised morphological analysis using a NPYLM, as past 
research has frequently pointed out. As shown in Table [TV] 
NPB-DAA had some phoneme recognition errors. However, in 
the NPB-DAA, latent words are sampled on the basis of the 
marginalized phoneme distribution before sampling concrete 
phoneme sequences. This property of the sampling procedure 
seemed to improve the performance of NPB-DAA. 

An example of the estimated latent variables is shown in 
Rg. m which shows the results for time series data correspond¬ 
ing to a vowel sequence, “ao ie ao.” The input time series data, 
i.e., 12-dimensional MFCC time series data, are shown at the 
top of the figures. The middle and the bottom figures show 
the inference process. The top of each figure shows the true 
latent letters or latent words, whereas the bottom shows the 
inferred result. The vertical axes represent the number of Gibbs 
sampling iterations. This shows that the inference procedure 
worked for human vowel sequence data, and could estimate 
an adequate unit for each word. 

Let us further examine the characteristics of the segmenta¬ 
tion results of the NPB-DAA. Table IIVI shows that some of the 
estimated latent words have a latent letter “6” at their head or 
tail. The latent letter “6” represents silence observed during the 
transition from one vowel to another. Silence in speech signals 
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TABLE IV 

Example word discovery results 


Vowel sequence 

Estimated NPB-DAA results 

Estimated conventional DAA results 

ao ie 

3 (5 0 1) / 4 (6 3 4 6) 

226 (2 0 3 4 1 5 4 1) 

ao ie ao 

3 (5 0 1) / 4 (6 3 4 6) / 3 (5 0 1) / 0 (6 4 6) 

494 (3) / 675 (2 3 0) / 374 ( 1 5 4 1 2 0 1) 

aue ie 

6 (6 5 1 2 6 4) / 4 (6 3 4 6) 

329 (2 3 8 4 5 4 1) 

ie ie 

4 (6 3 4 6) / 4 (6 3 4 6) 

389 ( 5 4 1 4 1 5 4 1) 

ie uo 

4 (6 3 4 6)/5 (5 1 2)/3 (5 0 1) 

401 ( 5 4 1 8 0 1) 

ie aioi 

4 (6 3 4 6) / 1 (5 6 4 6 3 6 1) / 4 (6 3 4 6) 

813 ( 5 4 1 2 4 5) / 832 (4 3 0 3 4 5 1) 


and the transitional sounds observed between two phonemes 
were treated in the same manner as other uttered sounds in our 
model. The question of whether such signals should be treated 
in the same way as other sounds in a generative model calls 
for further investigation. In our model, a phoneme is simply 
represented by a single Gaussian distribution, although many 
past speech recognition systems assign a richer structure to 
a phoneme, e.g., a three-state left-to-right HMM with GMM 
emission distributions. There is room for investigating whether 
a phoneme model, i.e., a latent letter, should itself have a 
more complex structure, or if a double articulation hierarchy is 
sufficient from the viewpoint of unsupervised word discovery 
tasks. 

An interesting result that represents a characteristic of the 
NPB-DAA is the latent word “4 (6 3 4 6)” estimated at the end 
of “ie aioi.” The speech signals corresponding to this “4” were 
a kind of transitional sound observed following “aioi.” The 
NPB-DAA directly inferred the latent word by marginalizing 
latent letters. In this case, it seems that “4” was more likely 
than other latent words, and the NPB-DAA hence generated 
this result. This can be regarded as a side effect of our 
approach, i.e., the marginalization of latent letter sequences 
in a latent word. We are confident that the marginalization of 
latent letters and the direct inference of word sequences are 
important to improving the performance of the unsupervised 
word segmentation of continuous speech signals, but there is 
room to consider this side effect. 

Note that the NPB-DAA performed unsupervised word 
discovery under the condition that the training data consisted 
of speech signals uttered by one speaker, in contrast with 
Julius, whose acoustic model was trained using many speak¬ 
ers’ speech signals. Speaker-independent, unsupervised word 
discovery from continuous speech signals remains a challeng¬ 
ing problem because the acoustic features of phonemes heavily 
depend on the speaker. When we gave four speakers’ speech 
signals to the NPB-DAA at the same time, the Letter ARI and 
the Word ARI decreased to 0.297 and 0.104, respectively. By 
contrast, those produced by Julius with a triphone acoustic 
model and a true word dictionary were 0.552 and 0.599, 
respectively. In the experiment, 120 audio data items that 
were recorded by asking two male and two female Japanese 
speakers to read 30 artificial sentences were used, i.e., a 
half of the data items used in the main experiment due to 
computational cost. It was observed that speaker “dependent” 
phoneme models were obtained by the NPB-DAA, i.e., speech 
signals representing the same phoneme uttered by deferent 
persons tended to be clustered to different latent letters. To 


develop a machine learning method that enables a robot to 
obtain language and acoustic models independent of speakers, 
or automatically adapting to different speakers is one of our 
future challenges. 

VII. Conclusion 

In this paper, we proposed NPB-DAA for direct and si¬ 
multaneous acquisition of language and acoustic models from 
continuous speech signals in an unsupervised manner. For 
this purpose, we proposed an integrative generative model 
called the HDP-HLM by extending HDP-HSMM. Based on 
the generative model, we derived an inference procedure by 
extending the blocked Gibbs sampler originally proposed for 
HDP-HSMM. The method is expected to enable a develop¬ 
mental robot to simultaneously obtain language and acoustic 
models directly from continuous speech signals. To evaluate 
the performance of the proposed method, two experiments 
were performed. In the first experiment, the proposed method 
was applied to synthetic data, and it was shown that the method 
can successfully infer latent words embedded in time series 
data in an unsupervised manner. In the second experiment, 
we applied the proposed method to actual human Japanese 
vowel sequences. The result showed that the proposed method 
outperformed a conventional two-stage sequential method, 
conventional DAA, and a baseline ASR method. 

One of the most important challenges in our future work is 
to achieve complete human language acquisition from speech 
signals. We did not achieve complete language acquisition 
from speech signals that includes consonants as well as vowels 
in this study. Language acquisition from more natural speech 
signals like child-directed speech by human parents are also 
part of our future work. To achieve these aims, we still have 
two main problems: feature extraction and computational cost. 

To address these problems, more sophisticated feature ex¬ 
traction methods are needed. Deep learning has gained at¬ 
tention recently because of its impressive feature extraction 
performance. Integrating a deep learning method into the NPB- 
DAA should improve its performance. 

Computational cost is another problem. Even though the 
size of the dataset used in the Experiment 2 was very small, 
it took approximately 240 minutes for 100 iterations using an 
Intel Xeon CPU E5-2650 v2 2.60 GHz, 8 cores x 16 CPU. 
In particular, the computational cost of the blocked Gibbs 
sampler was 0(TL max dl mx N} nax ), where L max is the maximum 
number of latent letters for a word, d max is the maximum 
duration of a word, and N max is the maximum number of words. 
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Fig. 7. Example of inference results for “ao ie ao.” MFCC feature vectors are 
plotted in the top panel. The middle and bottom panels show the inference 
results of latent letters and latent words, respectively. Different colors denotes 
different states. 


To apply the proposed method to a larger dataset, improving 
its computational cost will be necessary. 

Currently, the accuracy of the language acquisition is still 
limited, as shown in Table [III] In this paper, we focused on 
a language acquisition method based on distributional cues 
and proposed a mathematical model for language acquisition. 
Obviously, distributional cues are not enough for more ac¬ 
curate language acquisition. As suggested by several com¬ 
putational and robotic studies, making use of co-occurrence 
cues improves the accuracy of language acquisition GD-ED. 
The proposed HDP-HLM is a fully probabilistic generative 
model. Therefore, introducing other factors into consideration 
is relatively easier than for other heuristic models. This is 
also advantage of our approach. Combining prosodic and co¬ 


occurrence cues into the NPB-DAA, and obtaining a more 
accurate and more plausible constructive developmental lan¬ 
guage acquisition model is also a direction for future research. 
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